Linear regression assumes that the response variable Y is quantitative. However, in many situations we are dealing with qualitative response variables and we use Logistic regression in those cases. Generally, we will refer to these types of variables as categorical variables. For example: eye color is categorical since it has values like brown, blue, and green. Classification thereby involves assigning categorical variables to a specific class. Usually, we predict the probability of any observation belonging to a specific class.
There are many classification techniques, or classifiers, that could be used to predict a given qualitative response variables. Examples covered in this notebook include:
Logistic Regression Linear Discriminant Analysis K-nearest neighbors
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## 'data.frame': 48842 obs. of 15 variables:
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : Factor w/ 9 levels "?","Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
## $ fnlwgt : int 226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 2 12 8 16 16 1 12 15 16 6 ...
## $ educational.num: int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 2 2 1 2 ...
## $ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ native.country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 1 1 2 1 1 ...
We can see that these data have 48842 observations of 15 variables. Let’s create a few diagnostic plots to get a sense of the data.
adultincome <- subset(adultincome,(race %in% c('White','Black')) & (education %in% c('Masters','Bachelors','HS-grad','Assoc-voc','10th')))
ggplot(adultincome, aes(x = education, y = age))+
geom_bar(
aes(fill = income), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
facet_wrap(~gender)
p <- plot_ly(y = ~ age, color = ~ education, type = "box")
htmltools::tagList(list(p, p))
## Warning: package 'bindrcpp' was built under R version 3.4.4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
glm(gender ~ age, data=adultincome, family='binomial')
##
## Call: glm(formula = gender ~ age, family = "binomial", data = adultincome)
##
## Coefficients:
## (Intercept) age
## 0.39118 0.00928
##
## Degrees of Freedom: 28503 Total (i.e. Null); 28502 Residual
## Null Deviance: 35730
## Residual Deviance: 35640 AIC: 35640